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Abstract 


This study investigates the acquisition of the L2 French vowel lyl in a mobile- 
assisted learning environment, via the use of automatic speech recognition (ASR). 
Particularly, it addresses the question of whether ASR-based pronunciation instruc¬ 
tion using a mobile device can improve the production and perception of French 
lyl. Forty-two elementary French students participated in an experimental study in 
which they were assigned to one of three groups: (1) the ASR Group, which used an 
ASR application on their mobile devices to complete weekly pronunciation activities, 
with immediate written visual (textual) feedback provided by the software and no 
human interaction; (2) the Non-ASR Group, which completed the same weekly pro¬ 
nunciation activities in individual weekly sessions but with a teacher who provided 
immediate oral feedback using recasts and repetitions; and finally, (3) the Control 
Group, which participated in weekly individual meetings ‘to practice their conversa¬ 
tion skills’ with a teacher who provided no pronunciation feedback. The study fol¬ 
lowed a pretest/posttest design. According to the results of the dependent samples 
t-tests, only the ASR group improved significantly from pretest to posttest (p < 0.001), 
and none of the groups improved in perception. The overall success of the ASR group 
on the production measures suggests that this type of learning environment is propi¬ 
tious for the development of segmental features such as lyl in L2 French. 
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2 Learning L2 pronunciation 


Introduction 

In the context of mobile devices such as smartphones, tablets, and media play¬ 
ers, ASR (Automatic Speech Recognition) is found in the form of applica¬ 
tions (apps) which identify the words that a person speaks into a microphone, 
and automatically convert them into readable text. Recent developments in 
voice-to-text abilities have encouraged ASR’s implementation in computer- 
assisted language learning (CALL - e.g., Aist, 1999; Cucchiarini et al, 2009; 
Eskenazi, 1999; Hincks, 2003; Kim, 2006; Neri et al, 2008; Strik et al, 2009). 
In the context of pronunciation teaching, researchers suggest two possible 
applications for ASR (Dalby and Kewley-Port, 1999; Holland, 1999; Mostow 
and Aist, 1999): (1) to teach pronunciation of a foreign language; and (2) to 
assess students’ oral production. These applications have been adopted in a 
variety of studies on the use of ASR in computer-assisted pronunciation teach¬ 
ing (CAPT) at the segmental level in a second or foreign language (L2) (e.g., 
Bondar et al, 2011; Cucchiarini et al, 2009; Dalby and Kewley-Port, 1999; 
Kawai and Hirose, 2000; Kim, 2006; LaRocca et al, 1999; Levis, 2007; Mostow 
and Aist, 1999; Neri et al, 2006,2008; Penning de Vries et al, 2014; Strike! al, 
2009, 2012). Unfortunately, possibly because prosodic information is filtered 
out in ASR processing, the use of the technology has not received the same 
level of attention in the investigation of suprasegmental features (Coniam, 
2002; Honig et al, 2012; Kaltenboeck, 2002; Levis, 2007). 

One of the interesting aspects of ASR is that it fulfills the criteria pro¬ 
posed by Chapelle and Jamieson (2008) for selecting pronunciation software 
and activities to develop oral skills. Specifically, ASR allows for: (1) learner fit 
(ASR is useful for learners as it allows them to identify needed features); (2) 
explicit teaching (focus on particular pronunciation features and how they 
contrast with other sounds); (3) opportunities for interactions with the com¬ 
puter, including the ability for learners to speak and analyze their own pro¬ 
duction; (4) comprehensible and accurate feedback (e.g., visual feedback that 
uses forms and symbols with which learners are familiar); and (5) the devel¬ 
opment of strategies for learners to gain an understanding of new features on 
their own, outside of the language learning or classroom environment. 

The main goal of this study is to explore the use of mobile ASR as a peda¬ 
gogical tool to improve the pronunciation teaching and learning of L2 French. 
In our investigation, we focus on the acquisition of the French phoneme lyl 
(orthographically represented as “u”, as in “tu” /ty/ you - 2nd person singu¬ 
lar’) for two main reasons: (1) the sound is very difficult to acquire in both pro¬ 
duction and perception (e.g., Baker and Smith, 2010; Levy and Law II, 2010; 
Rochet, 1995); and (2) it has a high functional load (as defined by Brown, 1991 
and King, 1967) since it is used to distinguish many highly frequent minimal 
pairs in French, such as “tu” /ty/ ‘you (2nd person singular pronoun)’ and 


e^uinoxonline 


Denis Liakin, Walcir Cardoso and Natallia Liakina 3 


“tout” Itul ‘all, everything’, and “au-dessous” /od.su/ ‘below’ and “au-dessus” 
/od.sy/ ‘above’. 

To our knowledge, there are no studies that have investigated the use of 
ASR on mobile devices for pronunciation teaching and/or learning (see also 
Godwin-Jones (2009) for a similar observation), including the development of 
production and perception. 

Background 

ASR and the Acquisition of Second Language Pronunciation 

There are three categories of ASR systems (Rosen and Yampolsky, 2000; Young 
and Mihailidis, 2010), which are differentiated by the degree of user train¬ 
ing required prior to use: (1) speaker dependent; (2) speaker independent; 
and (3) speaker adaptable. Speaker dependent ASR requires the user to train 
the speech recognizer with samples of his/her own speech; consequently, the 
system works well only for the person who trains it. Speaker independent ASR 
does not require speaker training prior to use because the recognizer is pre¬ 
trained during system development with speech samples from a variety of 
speakers. Many different speakers will thus be able to use the same ASR appli¬ 
cation with relatively good accuracy as long as their speech falls within the 
range of the collected samples. Speaker adaptable ASR is similar to speaker 
independent ASR in that no initial speaker training is required prior to use. 
However, unlike speaker independent ASR systems, as the speaker adaptable 
ASR system is used over time, the recognizer gradually adapts to the speech 
of the user. 

Another way of characterizing ASR technology is by the type of input that 
the system can handle: (1) isolated/discrete word recognition; (2) connected 
word recognition; and (3) continuous speech recognition (see Jurafsky and 
Martin, 2008; Rabiner and Juang, 1993; Rosen and Yampolsky, 2000). Discrete 
word recognition requires a pause or period of silence to be inserted between 
words or utterances. Connected word recognition is an extension of discrete 
word recognition and requires a pause or period of silence after a group of 
connected words have been spoken. In continuous speech recognition, an 
entire phrase or complete sentences can be spoken without the need to insert 
pauses between words or after sentences. Different combinations of these ASR 
types have been utilized in the CAPT/ASR literature. In our experiment, we 
adopted a speaker independent system designed for continuous speech recog¬ 
nition, as will be described in the forthcoming methodology section. 

The majority of the studies that have investigated the effects of ASR on the 
acquisition of L2 pronunciation have shown that, despite many limitations, 
this technology has the potential to be effective. Early explorative work by 
Dalby and Kewley-Port (1999), LaRocca et al. (1999), and Mostow and Aist 
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(1999) indicated that ASR technology was still not as accurate as human analy¬ 
sis and, consequently, they suggested that the software could be useful for stu¬ 
dent practice with only certain aspects of pronunciation: segmental features. 
Recent developments in ASR designed particularly for language learning have 
shown the effectiveness of the technology for L2 pronunciation training (e.g., 
Cucchiarini et al, 2009; Neri et al., 2008; Strik et al., 2009). 

A different type of ASR can be found in the form of dictation software, 
which are computer applications that allow users to speak freely as the appli¬ 
cation transcribes what they say (e.g., NCH Express Dictate, Nuance Dragon 
Speech Recognition). Although not as commonly researched, early studies 
using this off-the-shelf technology include those of Coniam (1999), and Der- 
wing et al. (2000), who evaluated its recognition performance for English. The 
authors demonstrated that while dictation software offered positive results for 
native English speakers (90% accuracy), it performed less well for non-native 
speakers, leading the authors to conclude that the technology was not mature 
for use in L2 learning. 

A critique of this type of ASR appeared in Neri et al. (2003), where the 
authors described the inadequacies of dictation software: that the technology 
was developed to recognize native speech only and, as such, did not include 
any mechanism to provide feedback on pronunciation quality. Accordingly, 
these dictation packages performed poorly with non-native speakers because 
of the acoustic variations found in their speech. Specially-designed ASR sys¬ 
tems, on the other hand, have better recognition performance with non-native 
speech because their underlying acoustic models are prepared to accept the 
mispronunciations that language learners are expected to make. 

In sum, the available literature suggests that ASR technology may have pos¬ 
itive effects on the acquisition of L2 pronunciation. With regards to dictation 
software, despite the pessimistic results obtained in the studies conducted over 
a decade ago, our experience with current ASR systems suggests that the tech¬ 
nology has advanced considerably in the detection of non-native speech. We 
thus hypothesize that ASR software designed for dictation could be beneficial 
for pronunciation training, and that learners will also benefit from the tech¬ 
nology if it is offered in a portable format. 

Mobile technology and second language acquisition 

The use of mobile devices such as smartphones, media players and camcord¬ 
ers for language learning has sparked the interest of an increasing number of 
researchers over the last decade, particularly in the field of vocabulary acqui¬ 
sition (e.g., Kiernan and Aizawa, 2004; Kennedy and Levy, 2008; Lu, 2008; 
Zhang et al., 2011). Despite being considered by teachers and parents as a dis¬ 
traction in the classroom, these studies suggest that mobile devices can be 
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useful for language learning. In addition, their multimedia capabilities can 
help students have more authentic learning experiences, situating learning 
within their cultural and linguistic schemata (Joseph and Uther, 2009). 

Despite encouraging results, Kukulska-Hulme and Shield (2008) observed 
that Mobile-Assisted Language Learning has not yet been embraced on a 
large scale and has not yet received sufficient research attention toward its 
full potential as a pedagogic practice. Along the same lines, Joseph and Uther 
(2009) stressed that the value of using mobile devices and incorporating mul¬ 
timedia elements into language learning applications needs to be quantified 
with controlled experiments, where the control groups study on non-mobile 
platforms or in mobile contexts with non-technical support, e.g., via paper 
flashcards. According to these two authors, experiments of this sort should be 
prioritized in future research. The current study addresses this recommenda¬ 
tion by incorporating a control and a comparison (teacher-driven) group with 
characteristics similar to what these authors recommended. 

Consistent with Godwin-Jones’ (2009) observation, as indicated earlier, we 
are not aware of any study that investigates the use of ASR on mobile devices 
and its effects on L2 pronunciation. To assess the viability of using mobile ASR 
technology and to test its effects on learning, we chose to focus on the acquisi¬ 
tion of L2 French /y/. 

French lyl and its acquisition: Production, perception, and 
functional load 

The target French pronunciation feature examined in this study was the vowel 
lyl. This is an ideal target phoneme for pronunciation instruction because, as 
mentioned earlier, lyl is highly problematic for L2 learners in both production 
and perception (Baker and Smith, 2010; Levy and Strange, 2008). One possi¬ 
ble explanation for why this phoneme is so difficult to acquire might be due 
to its perception by L2 learners whose languages lack lyl in their phonologi¬ 
cal inventories. 

In first language acquisition, the most accepted assumption is that per¬ 
ception must precede production, because while children are assumed 
to quickly develop adult-like competence (Hale and Reiss, 1998; Stampe, 
1973), their articulators do not follow the same rate of development. For L2 
acquisition, on the other hand, the issue is not as straightforward, and has 
led researchers to stand on one or more sides of three logical hypotheses 
regarding the relationship between perception and production: (1) percep¬ 
tion precedes production (e.g., Flege, 1995; Borden et al., 1983); (2) produc¬ 
tion precedes perception (e.g, Sheldon and Strange, 1982; Sheldon, 1985); 
and (3) production and perception develop simultaneously (e.g., Flege, 
1999; Koerich, 2006). 
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The common denominator among these hypotheses is that they recognize 
the importance of LI phonotactic knowledge, against which all foreign features 
are ‘filtered’ and subsequently categorized into a language system (interlan¬ 
guage). This notion has given rise to at least two models for speech perception 
and L2 learning in general: Flege’s (1995) Speech Learning Model (SLM) and 
Best’s (1993; 1995) Perceptual Assimilation Model (PAM). Leaving aside con¬ 
ceptual differences between these models, Flege’s SLM model postulates that 
phonetically similar L2 sounds are more likely to be perceived via the LI than 
those that are dissimilar, possibly due to perceptual salience in the latter case. 
In the case of similar sounds, the foreign segment (or phonetic feature) is sub¬ 
sumed within the existing perceptual representation for a comparable sound in 
the LI, which as a result leads to a so-called foreign accent. Similar predictions 
are also made by Best’s PAM model, which proposes that ‘non-native segments 
[... ] tend to be perceived according to their similarities to, and discrepancies 
from, the native segmental constellations that are in close proximity from them 
in phonological space’ (p. 193). In Best’s view, novel sounds are assumed to be 
either assimilated to a native LI category (in the case of similar sounds) or to an 
uncategorizable sound that will form a new category (in the case of dissimilar 
sounds). To summarize, these two models predict that new L2 segments that 
are perceptually similar (assimilable) will be of greater difficulty to acquire than 
those that are dissimilar (unassimilable). 

It is premature and beyond the scope of this investigation to provide a def¬ 
inite answer to the question regarding the nature of French lyl as ‘filtered’ by 
the different LI phonologies included in this study (see the method section): 
Does the representation of French lyl pattern with the similar or the dissimi¬ 
lar scenarios predicted by the SLM and PAM models? However, based on the 
high degree of difficulty that many French L2 learners have in acquiring lyl 
(e.g., Baker and Smith, 2010; Levy and Law II, 2010; Rochet, 1995), and on the 
fact that the Lis considered in this study have equivalent (and ‘assimilable’) 
/y/s (e.g., /u/ for English, Farsi, and Spanish speakers, and HI for Portuguese 
speakers - see forthcoming discussion), it is reasonable to conjecture that lyl 
can be subsumed under the similar pattern. Accordingly, these LI speakers 
will categorize French lyl based on their LI phonotactic knowledge of a simi¬ 
lar LI phoneme: HI or /u/. 

In addition to its difficulty in production and perception, French lyl has a 
high functional load (King, 1967), a concept used to describe the extent and 
degree of contrast between linguistic units, usually phonemes. In phonology, 
it is a measure of the work that two phonemes do to maintain phonemic con¬ 
trast in all possible environments, involving minimal pairs whose members are 
both frequent (Brown, 1991). Consequently, certain phonemes in a language 
have a higher functional load than others depending on the degree to which 
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they contrast in meaning. For instance, French /u-y/ is used to distinguish 
highly frequent French minimal pairs such as au-dessous /odsu/ ‘below’ from 
au-dessus /odsy/ ‘above’, a type of alternation that, due to its high frequency, may 
affect intelligibility. Because many languages lack this highly functional pho¬ 
neme, it is essential that it be mastered early on in order to not compromise 
meaning in the target language. This is one of the arguments that Jenkins (2000; 
2002) used in her rationale for proposing her version of the English as a Lingua 
Franca approach, particularly in deciding priorities for pronunciation teaching. 
According to the author, priority should be given to sounds that have a high 
functional load, a requirement which we believe is fulfilled by French /y/. 

Research questions and predictions 

The purpose of the present study is to investigate the acquisition, in terms of 
production and perception, of the French vowel lyl in a mobile ASR-based 
learning environment. It thus aims to examine the feasibility of using mobile 
ASR as a pedagogical tool for L2 pronunciation learning. Accordingly, the fol¬ 
lowing two research questions guided our investigation: 

1. Does ASR-based pronunciation practice using a mobile device im¬ 
prove French L2 lyl production ? 

2. Does ASR-based pronunciation practice using a mobile device im¬ 
prove French L2 lyl perception ? 

On the basis of the research discussed earlier, we hypothesized that ASR 
would have a positive effect on lyl production, since an explicit focus on pro¬ 
nunciation in an ASR-based environment may improve learners’ production. 
This assumption is consistent with the works of Neri et al. (2008) and Cucchi- 
arini et al. (2009), among others. With respect to the second question, we pre¬ 
dicted that learners would be able to extend the newly acquired productive skill 
into perception, as has been attested (but less commonly) in the literature (e.g., 
Aliaga-Garcia and Mora, 2009; Bradlow et al., 1997; Jongman and Wade, 2007; 
but note that these studies focus on the effects of phonetic training on the devel¬ 
opment of perceptual skills). For the sake of this study, we define perception as 
the participant’s ability to discriminate between a set of options, namely lyl, /u/ 
and /i/, embedded in words, phrases and sentences, as will be discussed later. 

Method 

Participants 

Forty-two adult students of French as a second language participated in this 
study, with an average age of 22 (30 female, 12 male). All participants were 
recruited from three intact L2 French classrooms at two Anglophone univer¬ 
sities in Montreal. They were either native English speakers or had native-like 
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proficiency in the language (English: n = 27, Farsi: n = 2, Spanish: n = 7, Bra¬ 
zilian Portuguese: n = 2, Chinese: n = 2, Serbian: n = 1, Japanese: n = 1). All 
participants had elementary-level proficiency in French (A2 level, according 
to the Common European Framework of Reference for Languages - this is 
a requirement for enrolment in the ‘Elementary French’ classes from which 
the participants were recruited) and, accordingly, had not yet fully acquired 
the target phoneme /y/. Because of these requirements (i.e., A2 level profi¬ 
ciency and performance on the pretest), data from eight students were dis¬ 
carded because they scored more than 50% accuracy in lyl production and 
perception in the pre-test. 

Design of the study, experimental groups, and treatment 

Following Chapelle’s (2001, 2012) recommendation for conducting method¬ 
ologically convincing CALL research, this study followed a mixed-methods 
approach, using a pre/post research design (quantitative) followed by surveys 
and interviews with the participants (qualitative). Due to the scope of this 
study and its main goals, the focus will be primarily on the analysis of the 
quantitative results. 

The 42 participants recruited for this study were randomly assigned to one 
of three distinct groups, each corresponding to an experimental group: ASR, 
NASR (Non-ASR) and CTL (Control). During the treatment period, the par¬ 
ticipants were not informed about the nature of study, except that it was about 
‘an app that could help second language learners improve their French’. Figure 
1 illustrates the general design of this study, which will be discussed in detail 
below. 


f Week 1 : Pretest via CAN-8 [Production & Perception] j 


r 


ASRQn=14) | 

| - 5 ASR-based 
activities: Feedback 
from mobile ASR 


| - 5 teacher-instructed 
activities: Feedback 
from teacher 




I NASR (n= 14) I I CTL(n=14) I 

I r- 1 


{ 


5 teacher-based 
interactions: No 

feedback 


Week 5: Posttest via CAN-8 [Production/Perception] + Interviews 


Figure 1 . Design of the study. ASR = Automatic Speech Recognition group; NASR = 
Non-ASR group; CTL = Control group. 
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The ASR Group corresponded to the group that practiced French pronun¬ 
ciation using mobile ASR on an iPod Touch or iPhone using a commercial 
(but free) ASR application: Nuance Dragon Dictation, a speaker independent 
dictation system designed for continuous speech recognition, as described 
earlier. To test the accuracy of the application, five native French speakers pro¬ 
nounced all words/phrases included in the study using the university's wi-fi 
connection. One hundred percent of utterances were recognized correctly. 
The students completed on a weekly basis, either at home or at the univer¬ 
sity, five 20-minute pronunciation activities that consisted of reading aloud 
the target words and phrases in French using the ASR software installed on 
their mobile devices. After each reading attempt, students were provided with 
immediate written visual feedback via an orthographic representation of their 
attempt. To illustrate, if students attempted to pronounce the word pure’ [pyr] 
and ‘pour’ or ‘pire’ appeared on their screen as the written (visual) result, 
this indicated that their pronunciation was incorrect, thus requiring another 
attempt. In some cases, a slow internet connection or background noise would 
affect the results, but students were aware of this limitation and were there¬ 
fore asked to be patient, try again on another network, or wait until they were 
on university premises for a faster and more reliable wireless connection. The 
ASR participants were asked to spend approximately one minute per word/ 
phrase, depending on the level of difficulty of each target phrase, for a total of 
20 minutes. To ensure that the participants completed the assigned ASR-based 
pronunciation activities, they were also asked to indicate, on a ‘pronunciation 
form’ (see Appendix), the number of times they repeated each form until they 
were able to produce it accurately or until their one-minute limit had expired. 

The ‘Non-ASR Group’, on the other hand, did not have access to mobile ASR. 
However, they completed the same activities that the ASR participants did: They 
read aloud the same words and phrases in individual, weekly 20-minute ses¬ 
sions with a French teacher who provided immediate oral feedback on their 
pronunciation using recast and repetitions. To accommodate the nature of the 
intervention received by this group, the pronunciation form (Appendix) was 
slightly adapted (e.g., the irrelevant question ‘what word/s do you see on the 
screen?’ was removed). For comparable treatments, the teacher was asked not to 
volunteer any pronunciation practice that emphasized the target phoneme /y/; 
instead, the teacher was asked to concentrate on the items listed on the form and 
provide only what feedback was necessary for the completion of the activities. 

Finally, the ‘Control Group’ participated in weekly individual 20-minute 
meetings with the goal of practicing their conversation skills with a French 
teacher who provided no feedback on lyl pronunciation. These sections could 
be described as conversation classes, in which the participant and the teacher 
engaged in discussions of a variety of topics about school, aspirations, family, 
etc. 
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Table 1 illustrates the activities accomplished throughout the duration 
of the study, the focus of instruction, the type of feedback provided in each 
group, and the length of each corresponding treatment. 


Table 1. Experimental groups and related activities 


Experimental groups 


ASR 

NASR 

Control 

Activity 

Oral ASR activities 

Oral teacher-based activities 

Conversation with teacher 

Focus 

/y/+ distractors 

/y/ + distractors 

None 

Feedback 

Mobile ASR (written) 

Teacher (oral) 

None 

Length 

Five 20-min weekly sessions 

Five 20-min weekly sessions 

Five 20-min weekly sessions 


Tasks: Pretest and posttest 

For the production and perception tasks used to measure students’ pronunci¬ 
ation capabilities, we employed CAN-8 VirtuaLab, ‘an interactive, multimedia 
tool used for the instruction of modern languages’ with which the participants 
were familiar (they used it on a weekly basis in the university’s language lab to 
complete general language activities). 

The production task consisted of reading words and phrases aloud, which 
were recorded automatically using CAN-8 in the university’s language lab, 
without the presence of the researcher or teacher. We targeted 20 instances of 
lyl (plus 15 distractors) in 19 keywords (one containing two instances of the 
target phoneme) which were carefully selected so that lyl occurred in equally 
distributed syllabic environments: 10 in open, vowel-final syllable structures 
(e.g., -du [dy] in ‘defendu’), and 10 in closed, consonant-final syllable contexts 
(e.g., cul- [kyl] in ‘culture’). The words selected for the production task were: 
assume, azur, chute, culture, defendu, fumes, lune, musique, numero, particule, 
perdu, plu, pulverise, surtout, tu, ultime, unanime, une, and vu. Figure 2 shows 
the CAN-8 interface illustrating a production task: 


File Edit Options Play Record Help 



He Lp Record 1Slow 

Read and record 

Exit 


Next 

Read the 

When you finish recording, click Pause. Click Play to check the sound quality of 
your recording. Click Next to complete the activity. 

Hier, il a vu ses parents. 

La lune est enorme cette nuit. 

Fumes-tu des cigarettes? 

Ma mere et mon pere habitent a Paris. 

J'assume une tache difficile, 
assister au spectacle 
la chute de neige 
petite particule 



following phrases and sentences. To start the recording, click Record. 


Figure 2. The production task: An example. 


e^uinoxonline 












Denis Liakin, Walcir Cardoso and Natallia Liakina i 1 


In the perception task, we employed pseudowords to avoid frequency and 
familiarity effects; this was based on the assumption that the productivity of a 
pattern is often determined by its frequency in the input, i.e., ‘the more items 
encompassed by a schema, the stronger it is, and the more available it is for 
application to new items’ (Bybee, 2001:13; see also Flege et al, 1996 for similar 
assumptions). To illustrate a possible frequency (and consequently a familiar¬ 
ity) effect in the context of the study, some participants could select the French 
word ‘tu’ [ty] as containing lyl simply due to their familiarity with the word as 
a consequence of its high frequency in their language input (and possibly in 
their language output). 

In the perception experiment, the participants listened to 45 monosyl¬ 
labic ‘French’ pseudowords containing the vowels lyl, /u/ and /i/ (15 instances 
of each vowel; e.g., fuppe [fyp], foupe [fup], fippe [fip]). The vowels /u/ and 
HI were included as distractors even though they have been reported as the 
most confusable vowels in the identification of French lyl. For instance, while 
Anglophone listeners perceive the French lyl as their LI back /u/ vowel (Gott¬ 
fried, 1984; Rochet, 1995), Brazilian Portuguese and Haitian Creole listeners 
perceive the same vowel as their own lil (Rochet, 1995). 1 

The perception task followed a four-item multiple-choice format, with each 
alternative representing one of the relevant three vowels described above, and 
‘I don’t know’ as the fourth choice to minimize random selection. After listen¬ 
ing to a pseudoword such as fuppe [fyp], participants were asked to choose the 
alternative that corresponded to what they heard. Figure 3 illustrates the inter¬ 
face of the perception task on CAN-8. 



Figure 3. Perception task: An example. 
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Analysis 

To assess the students’ production, two bilingual francophone RAs listened 
to each students recordings and determined whether the pronunciation of 
lyl was correct or incorrect. In the case of conflicting opinions, a member of 
our team listened to those occurrences and made the decision. If an item was 
ambiguous or unclear, that item was excluded from the computation. In total, 
there were 1,680 occurrences of lyl with an inter-rater reliability of 88.7% 
(1,490/1,680). Assessment of the students’ perception was done automatically 
by CAN-8, which is programmed to assess each response as correct or incor¬ 
rect according to the stimulus input into the system. 

For the statistical analysis of the data and to test for differences among the 
three groups on the pretest and posttest, a one-way ANOVA was performed at 
each time for production and perception. To test for differences within each 
group over time, dependent sample f-tests were carried out to compare the 
pretest to posttest performances for each group. 

Results 

The general descriptive statistics of the analysis for lyl production and percep¬ 
tion appear in Table 2. The mean scores (M) of accurate production and per¬ 
ception are presented as well as the standard deviations (SD) across the two 
tests (pretest and posttest) and the three groups under consideration (ASR, 
Non-ASR and Control). Because there were ten tests performed, the alpha 
level had to be adjusted and set at 0.005 (0.05/10 tests). Overall, the results of 
the one-way ANOVA indicated no differences among the three groups either 
on the pretest or the posttest in both lyl production (F (2, 39) = 0.95 ,p = 0.392 
and F (2, 39) = 0.90, p = 0.413 in pre- and posttest respectively) and lyl per¬ 
ception (F (2, 39) = 1.57, p = 0.221 and F (2, 39) = 0.32, p = 0.731 in pre- and 
posttest respectively). 

Table 2. Descriptive statistics for lyl production and lyl perception over time, across 
the three groups (Mean scores) 


Production (n = 20) Perception (n = 15) 

ASR Non-ASR Control ASR Non-ASR Control 


Test 

M 

SD 

M 

SD 

M 

SD 

M 

SD 

M 

SD 

M 

SD 

Pre 

7.09 

5.51 

9.79 

3.98 

8.43 

5.86 

8.07 

3.64 

9.64 

3.67 

10.29 

2.81 

Post 

10.71 

4.23 

11.86 

3.98 

9.50 

5.53 

9.93 

2.89 

10.29 

3.77 

10.93 

3.40 


To test for differences within each group over time, dependent samples 
f-tests were carried out comparing pretest performance to posttest per¬ 
formance for each group. In this analysis, only the ASR Group improved 
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significantly from pretest to posttest in lyl production (p < 0.001), and no 
group improved in lyl perception. 2 The following two sections will provide a 
detailed report of each of these sets of results. 

Production of French lyl 

The first research question asked whether ASR-based pronunciation practice 
using a mobile device would improve French L2 lyl production. According to 
the results of the dependent samples f-tests, only the ASR group improved sig¬ 
nificantly from pretest to posttest (p < 0.001). This indicates that learners who 
received instruction via the mobile ASR application displayed more improve¬ 
ment over time than those who received teacher-based input and feedback 
(Non-ASR) or no input or feedback whatsoever (Control). As such, these 
results support our initial hypothesis that the pedagogical use of a mobile ver¬ 
sion of ASR would have a positive effect on lyl production. 

For illustrative purposes, the results for production are presented in Figure 
4, where the mean scores for accurate lyl production are presented. 


o 

CM 


c 

ro 

tu 


14 

12 

10 

8 

6 

4 

2 

0 


p < .001 


10.71 


ASR 


p = .02 


11.86 


Non-ASR 


p = .38 


Control 


□ Pretest 

□ Posttest 


Figure 4. lyl production results 

Perception of French lyl 

The second research question asked whether ASR-based pronunciation prac¬ 
tice using a mobile device would improve French L2 lyl perception. The results 
of the dependent samples f-tests, illustrated in Figure 5, indicate that despite 
slightly greater gains for the ASR group, the three groups behaved in a similar 
way (pre/posttest differences: ASR: p > 0.05; Non-ASR: p > 0.38; Control: p > 
0.37). This indicates that the group that received ASR-based treatment was not 
able to extend the newly acquired knowledge detected in production to per¬ 
ception; accordingly, our initial hypothesis was not supported. 
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Figure 5. lyl perception results 

Discussion 

The main goal of this study was to explore the use of ASR software on mobile 
devices as a pedagogical tool for improving L2 French pronunciation in pro¬ 
duction and perception. With regards to production, the results indicate that, 
similar to what is observed in the general (non-mobile) ASR literature, the use 
of mobile speech recognition appears to have a positive effect on the acquisi¬ 
tion of the French vowel lyl (see also Cucchiarini et al, 2009 and Neri et al, 
2008 for similar results involving segments). We attribute these learning gains 
to a variety of factors that include insights from the general SLA/CALL litera¬ 
ture, notably Chapelle’s (2001) ideas about input enhancement and computer- 
aided interaction (e.g., lyl pronunciation was reinforced via orthography, input 
manipulation and repetition among ASR users), the effects of an explicit focus 
on the target form (Dabaghi, 2010; Dekeyser, 1993), immediate feedback (Rosa 
and Leow, 2004), multiple opportunities for learning (Christison, 1999; Chun 
and Plass, 1996), and the game-like approach to teaching afforded by mobile 
technologies (Bruff, 2009). Lastly, mobile ASR technology, as utilized in this 
study, ascribes to Chapelle and Jamiesons (2008) suggestions for selecting pro¬ 
nunciation software to develop speaking skills, based on research by Hardison 
(2004; 2005), Derwing et al. (1998), and MacDonald et al. (1994). Accordingly, 
the mobile ASR technology adopted in this study provides: learner fit (it empha¬ 
sizes a feature that the participants needed to improve); potential for explicit 
teaching and learning; opportunities for interactions with the computer; com¬ 
prehensible and visual (orthographic) feedback; and strategy development to 
guide students to start learning new L2 features on their own outside of the lan¬ 
guage learning environment. Evidently, we are aware that the observed gains 
could also be caused by the effect of the adoption of a new technology, which 
may have increased the overall interest and motivation of the students (Clark, 
1983; Strambi, 2001; Warchauer, 1996). 
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Regarding perception, our results indicate that L2 learners were not able 
to transfer the acquired knowledge about lyl production into perception. We 
attribute this result to at least two main factors. First, it is possible that the total 
of 1.5 hours of instruction were not sufficient for learners to acquire lyl in per¬ 
ception and thus locate this foreign phoneme within the phonological system 
that characterizes their Lis. As discussed earlier, this phoneme is highly com¬ 
plex from both an acoustic and articulatory perspective (Baker and Smith, 
2010; Levy and Strange, 2008); this may affect its acquisition in perception, 
particularly in an experiment in which no emphasis was given to the devel¬ 
opment of perceptual skills. Secondly, we admit that we were originally opti¬ 
mistic about our conjecture that a focus on production could translate into 
gains in perception, as has been argued in studies that focus on the effects of 
phonetic training on the development of perceptual skills (e.g., Aliaga-Garcia 
and Mora, 2009; Bradlow et al, 1997). Instead, the results obtained in our 
study seem to conform to those related to the acquisition of /r/ and III by Jap¬ 
anese learners of English (e.g., Hattori, 2009; Sheldon and Strange, 1982). In 
these studies, L2 learners were able to produce these two English liquids more 
reliably than they were able to perceive them. In Hattori (2009), for instance, 
Japanese learners could be trained to produce native-like English /r/ and III 
approaching 100% accuracy, while the same learners could not distinguish 
between the two phonemes in perception experiments. In sum, along the lines 
of Hattori (2009) and Sheldon and Strange (1982), our findings seem to sug¬ 
gest that speech production can sometimes precede its perception (at least for 
the acquisition of lyl in a ASR-based context), as the participants in the ASR 
group improved only in the former. However, based on the general trends 
observed (e.g., the ASR group did outperform the other two groups, but not 
significantly), the results point in an optimistic direction regarding the use of 
mobile ASR for the development of speech perception. 

Concluding remarks 

The present study revealed a significant improvement in lyl production by the 
group that trained in an ASR-based environment. The overall success of this 
group on the production measures suggests that this type of learning environ¬ 
ment is propitious for the development of L2 French lyl and, we speculate, for 
the development of other related segmental features. This has both theoretical 
and practical relevance. 

With regard to its theoretical contribution, albeit limited in scope, the study 
initiates a debate on the potential and feasibility of using mobile ASR technol¬ 
ogy for the teaching and learning of L2 segments, particularly French lyl. In 
addition, the results obtained reinforce some of the well-established notions 
instituted in the CALL and SLA literature in the context of mobile technology. 
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As discussed earlier, these include learner autonomy (Holec, 1981; Schwien- 
horst, 2008), immediate feedback (Rosa and Leow, 2004), explicit instruc¬ 
tion (Dekeyser, 1993), input enhancement (Chapelle, 2001), and multimodal 
exposure to the forms being acquired (Christison, 1999). 

From a pedagogical standpoint, we believe that ASR software on mobile 
technology should be further explored as a potential complement to pronun¬ 
ciation activities conducted in language classrooms: It may not only promote 
the acquisition of segments, as demonstrated in this study, but it can also be 
used by teachers and students without much preparatory work (contrary to 
the types of specially-designed ASR systems used in studies such as those of 
Neri et al, 2006), and it provides a type of feedback that is easily understood, 
via orthography. In the classroom, teachers could emphasize meaningful com¬ 
municative tasks, as recommended by L2 pedagogues (e.g., Littlewood, 2004; 
Nunan, 2004), while assigning certain pronunciation tasks (for instance, those 
that are repetitive and require a special focus on articulation) as personal¬ 
ized homework assignments. Those tasks could target particular pronunci¬ 
ation problems such as French /y/: it is difficult to produce and perceive; it 
requires the articulation of ‘funny lip-rounding’ which may inhibit shy stu¬ 
dents in public environments; and it has a high functional load, meaning that 
its mispronunciation has high chances of affecting intelligibility. Accordingly, 
we believe that ASR can and should be used in the language learning environ¬ 
ment because: (1) it has the potential to improve L2 learners’ pronunciation, 
as we have shown here; (2) it can relocate resources so that classroom time 
can be used exclusively (or mostly) for communicative activities; (3) it accom¬ 
modates a wide variety of learners (e.g., those who benefit from the visual 
interactions afforded by ASR; Gardner, 1983); and, finally, (4) the technology 
was evaluated very positively by the participants, as indicated by the following 
samples of participants’ responses: 

• It is perceived as having a positive effect on pronunciation (‘[... ] will 
help you pronounce better; ‘They should definitely implement that in 
the grammar classes because it’s like you need to know how to pro¬ 
nounce things’); 

• It provides immediate visual feedback (‘You pronounce and you see 
right away what you’re pronouncing’; ‘It gives you the answers, you can 
see); 

• It is portable and convenient (‘It’s good to have homework that you 
can take home and practice pronouncing on your own, instead of just 
in the lab’); 

• It provides a different modality to learn (‘So, yeah. Especially when 
there’s no one else, [these are] different exercises. It’s better, yeah’); and 
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It encourages practice (7 get nervous when it’s, like, in person. So, it’s definitely 
easier, and then I can get more comfortable, and I don’t mess up so much, in 
person if I... you just get more confident’). 3 

We are aware that it is premature to arrive with certainty at generalizable 
conclusions in a study of such narrow scope (e.g., focus on a single phoneme, 
involving participants recruited from three intact classes in two universities) 
and in a field that is still in its infancy (mobile ASR-based technology for ped¬ 
agogical purposes). As such, there are some limitations that will deserve spe¬ 
cial consideration in future investigations. One of the major limitations of this 
study, as alluded to above, is its limited contribution to the field, as it investigates 
the acquisition of a single phoneme in French: lyl. Two important questions 
remain for further investigation: Will other phonological or phonetic items such 
as features (e.g., spread glottis, voice-onset time), syllable structure (e.g., codas), 
rhythm and intonation benefit from a similar (mobile) ASR treatment? What 
is the impact of ASR-based training on overall pronunciation skills (e.g., the 
development of intelligibility)? Another limitation of this study relates to what 
is referred to as the novelty effect. As has been attested in the computer-assisted 
learning literature (e.g., Nikolova, 2002; Warschauer, 1996), there is the possibil¬ 
ity that the gains observed in the ASR group are ephemeral, merely a reflection 
of what Clark (1983) defines as the novelty effect, wherein it is assumed that the 
improved performance observed is a response to the increased interest in the 
new technology, and not necessarily a direct influence of its use. Similarly, it is 
also possible that the improved learning observed in the ASR Group was affected 
by the instructional methods associated with the use of this new technology, as 
discussed above (e.g., the development of learner autonomy, encouragement of 
repetition, presence of immediate feedback and multimodal exposure). Only 
an extensive longitudinal study, conducted after the novelty factor has worn off, 
will be able to address this concern. 

Some of the methodological limitations of this study include: the short 
duration of the treatment and training sessions, the linguistic heterogeneity 
of the three groups of participants whose first languages differed (while most 
were native English speakers, some were multilingual), and the small number 
of participants. This latter limitation was mostly due to the fact that the major¬ 
ity of our participants did not own an appropriate device to participate in the 
study and, to a lesser extent, to participant attrition (one participant withdrew 
from the experiment due to illness). 

A potential direction for future research is to adapt and/or develop mobile 
technologies that can address the full spectrum of what linguistic compe¬ 
tence in an L2 entails. According to Dickerson (2004, 2013), this knowledge 
includes learners’ ability to perceive (e.g., distinguish /u/ from lyl), produce 
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(e.g., articulate lyl), and predict pronunciation patterns (based on grapheme- 
to-phoneme rules: e.g., while orthographic ‘u’ is pronounced as [y], ou’ and its 
orthographic variants such as ‘oup’ and out’ are produced as [u]). These three 
competence elements or ‘trilogy of goals’ (prediction, production, perception), 
can be easily explored in a mobile-assisted learning environment via a combina¬ 
tion of tools/apps that promote the development of production (ASR), percep¬ 
tion (text-to-speech synthesizers - TTS), 4 and prediction (ASR and TTS). 

Notes 

1. These types of patterns and observations led many L2 researchers to propose differ¬ 
ent models for second language speech perception. The two most notable ones are Flege’s (1995) 
Speech Learning Model, discussed earlier, and Best’s (1993, 1995) Perceptual Assimilation 
Model. For Best (1995), ‘non-native segments ... tend to be perceived according to their similar¬ 
ities to, and discrepancies from, the native segmental constellations that are in close proximity 
from them in phonological space’ (p. 193). 

2. A finding of no significant difference on the posttest does not mean that there could not 
be a significant change over time in each group. In our case, the interpretation of the results is a 
bit more subtle. The ASR group had the lowest mean at the pretest, but with such large within- 
group variability, as evidenced by the standard deviation, no significant difference was found 
between it and the other two groups. On the posttest, the ASR group did show the greatest gain 
in mean score and its within-group variability had decreased somewhat too. One could argue 
that this result was caused by the fact that the ASR group had more room to improve, since it had 
the lowest (but not significantly different) mean at the pretest. We must note, however, that the 
pretest to posttest movement for the other two groups was not limited by a ceiling effect. Only 
one subject in these two groups achieved 19 out of 20 on the posttest. The next highest score in 
both of these groups was 17 out of 20. Examining individual scores, the more dramatic change 
for the ASR group may be attributed to the huge change in scores for four of five very low scor¬ 
ers on the pretest (1 > 10, 2 > 10, 1 > 9, 2 > 7). Only one low scorer in the other two group made 
similar improvement from pretest to posttest (Non-ASR: 1 > 9; Control: 4 > 9). 

3. According to the participants, the two main weakenesses of the ASR system adopted are 
the unreliability of the internet connection, particularly during peak periods and in weak spots 
within the university premises (e.g., “Sometimes the app wouldn’t work because of my bad con¬ 
nection and I’d get frustrated”), and the level of pronunciation accuracy required by the app (e.g., 
“Sometimes I would pronounce it correctly, and my boyfriend is from Paris, and he would say “yeah, 
that’s right”, and then he would say it and it still didn’t come out’.’). We suspect that comments simi¬ 
lar to the latter can sometimes be attributed to the effects of a faulty internet connection. 

4. A text-to-speech synthesizer (TTS) is a computer program/app that generates speech 
from any written text automatically. TTS programs feature different speed levels of the speech 
output, both female and male speakers with different pitches (low and high), different accents 
of language varieties, and a highlight function that displays the words, sentences and paragraphs 
being read by the program in color. The quality of the synthesis has improved substantially over the 
years (Handley, 2009), and we believe that this is an appropriate time to start exploring this com¬ 
puter application, in a mobile environment, as a potential model for L2 speech. The main advantage 
of TTS is that it can be used as a means of enhancing the L2 aural input and, therefore, help learn¬ 
ers perceive some of the phonetic properties of lyl as well as the acoustic differences between this 
vowel and equivalent LI forms such as lu.1 and HI. This could potentially help learners decipher the 
intricacies of assigning a ‘phonological space’ to foreign lyl in their developing phonologies. 
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Appendix. Sample of pronunciation form used in weekly 
activities by the ASR Group 


Using Speech Recognition Week 1 Name_ 

Pronunciation form 

(Please, spend 1 minute per word/expression) 


Word 

# of 

attempts 

Succeeded? 

(Yes/No) 

if “No”, rvhat word(s) do you 
see on the screen? 

radio 




pour 




tu es grand 




lire 




amour 




line vie agreable 




pur / pure 




cours 




maman 




tour 




partir 




lecture 




nous avons pu 




une grande table 




manger dupoulet 




cours de franfais 




unpeu 




ilest sur 




courir 




duree 




partout 
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